Skip to content

Conversation

@labbott
Copy link
Contributor

@labbott labbott commented Oct 2, 2025

No description provided.

@morlandi7 morlandi7 linked an issue Oct 13, 2025 that may be closed by this pull request
@labbott labbott force-pushed the reconfigurator_cosmo branch from 164893c to ee4ae51 Compare October 15, 2025 16:48
Laura Abbott added 2 commits October 16, 2025 13:25
Copy link
Contributor

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comment! I have a few questions.

Has this been tested on a real cosmo yet? Is it even possible at this point?

Do the SP, RoT, RoT Bootloader, Host OS take roughly the same time to reset on a Cosmo as they do on a Gimlet? We have several timeouts scattered throughout the reconfigurator code based on reset times on a gimlet.

Here we give each component a very generous amount to finish its post update actions. In the simplest of cases like the SP, we just reset the device, but others like the RoT Bootloader and the Host OS have several steps.

// Timeouts, timeouts: always wrong!
//
// We have to pick some maximum time we're willing to wait for the `post_update`
// hook to complete. In general this hook is responsible for resetting the
// updated target to cause it to boot into its new version, but the details vary
// wildly by device type (e.g., post-update RoT bootloader actions require
// multiple RoT resets) and the expected amount of time also varies wildly
// (e.g., resetting a gimlet SP takes a few seconds, resetting a sidecar SP can
// take 10s of seconds, resetting a sled after a host OS update takes minutes).
fn post_update_timeout(update: &PendingMgsUpdate) -> Duration {
match &update.details {
PendingMgsUpdateDetails::Sp { .. } => {
// We're resetting an SP; use a generous timeout for sleds and power
// shelf controllers (which should take a few seconds) and an even
// more generous timeout for switches (which we've seen take 10-20
// seconds in practice).
match update.sp_type {
SpType::Sled | SpType::Power => Duration::from_secs(60),
SpType::Switch => Duration::from_secs(120),
}
}
PendingMgsUpdateDetails::Rot { .. } => {
// Resetting the RoT should be quick (a few seconds), but we wait
// for boot info after the reset.
Duration::from_secs(90)
}
PendingMgsUpdateDetails::RotBootloader { .. } => {
// Resetting the bootloader requires multiple RoT resets; give this
// a longer timeout.
Duration::from_secs(210)
}
PendingMgsUpdateDetails::HostPhase1(..) => {
// Resetting a sled takes several minutes (mostly DRAM training);
// give something very generous here to wait for it to come back.
Duration::from_secs(30 * 60)
}
}
}

Here we poll the device every 10 seconds to see if it has been reset.

if try_reset {
// We retry this until the component update has been successfully
// updated, or we get some error *other* than a communication error or
// some other transient error. There is intentionally no timeout here.
// If we've staged an update but not managed to reset the device,
// there's no point where we'd want to stop trying to do so.
while let Err(error) =
update_helper.post_update(log, &mut mgs_clients, update).await
{
if error.is_fatal() {
let error = InlineErrorChain::new(&error);
error!(log, "post_update failed"; &error);
return Err(ApplyUpdateError::SpComponentResetFailed(
error.to_string(),
));
}
// We only care whether the update has completed. We ignore all
// pre-check errors because they could all be transient if a reset
// is in the process of happening.
if let Ok(PrecheckStatus::UpdateComplete) =
update_helper.precheck(log, &mut mgs_clients, update).await
{
break;
}
tokio::time::sleep(RESET_DELAY_INTERVAL).await;
}
}

For the RoT and RoT bootloader we wait a specific amount of time for boot info to show up

pub const WAIT_FOR_BOOT_INFO_TIMEOUT: Duration = Duration::from_secs(120);
const WAIT_FOR_BOOT_INFO_INTERVAL: Duration = Duration::from_secs(10);

/// Poll the RoT asking for its boot information. This confirms that the RoT has
/// been succesfully reset
pub async fn wait_for_boot_info(
log: &Logger,
mgs_clients: &mut MgsClients,
sp_type: SpType,
sp_slot: u16,
timeout: Duration,
) -> Result<RotState, PostUpdateError> {
let before = Instant::now();
loop {
debug!(log, "waiting for boot info to confirm a successful reset");
match mgs_clients
.try_all_serially(log, |mgs_client| async move {
mgs_client
.sp_rot_boot_info(
&sp_type,
sp_slot,
SpComponent::ROT.const_as_str(),
&GetRotBootInfoParams {
version: RotBootInfo::HIGHEST_KNOWN_VERSION,
},
)
.await
})
.await
{
Ok(state) => match state.clone() {
RotState::V2 { .. } | RotState::V3 { .. } => {
debug!(log, "successfuly retrieved boot info");
return Ok(state.into_inner());
}
// The RoT is probably still booting
RotState::CommunicationFailed { message } => {
if before.elapsed() >= timeout {
error!(
log,
"failed to get RoT boot info";
"error" => %message
);
return Err(PostUpdateError::FatalError {
error: message,
});
}
info!(
log,
"failed getting RoT boot info (will retry)";
"error" => %message,
);
tokio::time::sleep(WAIT_FOR_BOOT_INFO_INTERVAL).await;
}
},
// The RoT might still be booting
Err(error) => {
let e = InlineErrorChain::new(&error);
if before.elapsed() >= timeout {
error!(
log,
"failed to get RoT boot info";
&e,
);
return Err(PostUpdateError::FatalError {
error: e.to_string(),
});
}
info!(
log,
"failed getting RoT boot info (will retry)";
e,
);
tokio::time::sleep(WAIT_FOR_BOOT_INFO_INTERVAL).await;
}
}
}
}

/// Poll the RoT asking for its boot information. This is used to check
/// the state for RoT bootloader image errors after RoT is reset
async fn wait_for_stage0_next_image_check(
log: &Logger,
mgs_clients: &mut MgsClients,
sp_type: SpType,
sp_slot: u16,
timeout: Duration,
) -> Result<Option<RotImageError>, PostUpdateError> {
debug!(log, "attempting to verify image validity");
match wait_for_boot_info(log, mgs_clients, sp_type, sp_slot, timeout).await
{
Ok(state) => match state {
// The minimum we will ever return is v3.
// Additionally, V2 does not report image errors, so we cannot
// know with certainty if a signature check came back with errors
RotState::V2 { .. } => {
let error = "unexpected RoT version: 2".to_string();
error!(
log,
"failed to get RoT boot info";
"error" => &error
);
return Err(PostUpdateError::FatalError { error });
}
RotState::V3 { stage0next_error, .. } => {
debug!(log, "successfully completed an image signature check");
return Ok(stage0next_error);
}
// This is unreachable because wait_for_boot_info loops for some
// time if it encounters `CommunicationFailed`, and if it hits the
// timeout, it will return an error.
RotState::CommunicationFailed { message } => {
error!(
log,
"failed to get RoT boot info";
"error" => %message
);
return Err(PostUpdateError::FatalError { error: message });
}
},
Err(error) => return Err(error),
}
}

// We wait for boot info to ensure a successful reset
wait_for_boot_info(
log,
mgs_clients,
update.sp_type,
update.slot_id,
WAIT_FOR_BOOT_INFO_TIMEOUT,
)
.await?;

Sorry for the late review btw! I was out for a couple of weeks

Comment on lines +968 to +972
// We assume a valid model ID for sleds
model: match sp_id.type_ {
SpType::Sled => GIMLET_SLED_MODEL.to_string(),
_ => format!("dummy_{}", sp_id.type_),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not too complex, it would be really nice to have some tests using a cosmo (unit or reconfigurator-cli ones).

"913-0000023" => Some(Self::Cosmo),
"913-0000019" | "913-0000006" => Some(Self::Gimlet),
COSMO_SLED_MODEL => Some(Self::Cosmo),
GIMLET_SLED_MODEL | "913-0000006" => Some(Self::Gimlet),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise this was already here, but is 913-0000006 just a Gimlet ID that is used for test or legacy devices? Or is this used elsewhere? Do you mind adding a comment or so we know where this ID comes from, and where it is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, I actually think that's incorrect now that I'm double checking it.

@dancrossnyc When you were working on #9016 where did you grab "913-0000006" from? At least according to pilot that's a sidecar serial https://github.com/oxidecomputer/pilot/blob/main/common/src/sp.rs#L135

@labbott
Copy link
Contributor Author

labbott commented Nov 4, 2025

Thanks for addressing my comment! I have a few questions.

Has this been tested on a real cosmo yet? Is it even possible at this point?

Unfortunately we've been blocked on cosmo device ID. Without a device ID, we can't connect via sprockets which means we can't run RSS which means we can't run any actual services to run reconfigurator. This is also why I've led this PR sit here.

Do the SP, RoT, RoT Bootloader, Host OS take roughly the same time to reset on a Cosmo as they do on a Gimlet? We have several timeouts scattered throughout the reconfigurator code based on reset times on a gimlet.

This is a slightly tricky question to answer. We know that we've seen Cosmo take more time than Gimlet for some things but we also haven't had a chance to finish profiling to see if this is expected or not. I think this will be something that will become more obvious as we actually start testing on actual hardware (thank you for the pointers on where to check!)

@jclulow
Copy link
Collaborator

jclulow commented Nov 6, 2025

Unfortunately we've been blocked on cosmo device ID. Without a device ID, we can't connect via sprockets which means we can't run RSS which means we can't run any actual services to run reconfigurator. This is also why I've led this PR sit here.

NB: I have just completed signing on the two Cosmos in question. They're now pending reinstallation into london; see oxidecomputer/meta#776!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Successfully update a Cosmo via reconfigurator

4 participants